Goto

Collaborating Authors

 command line


Democratizing ML for Enterprise Security: A Self-Sustained Attack Detection Framework

Momeni, Sadegh, Zhang, Ge, Huber, Birkett, Harkous, Hamza, Lipton, Sam, Seguin, Benoit, Pavlidis, Yanis

arXiv.org Artificial Intelligence

Abstract--Despite advancements in machine learning for security, rule-based detection remains prevalent in Security Operations Centers due to the resource intensiveness and skill gap associated with ML solutions. While traditional rule-based methods offer efficiency, their rigidity leads to high false positives or negatives and requires continuous manual maintenance. This paper proposes a novel, two-stage hybrid framework to democratize ML-based threat detection. The first stage employs intentionally loose Y ARA rules for coarse-grained filtering, optimized for high recall. T o overcome data scarcity, the system leverages Simula, a seedless synthetic data generation framework, enabling security analysts to create high-quality training datasets without extensive data science expertise or pre-labeled examples. A continuous feedback loop incorporates real-time investigation results to adaptively tune the ML model, preventing rule degradation. This proposed model with active learning has been rigorously tested for a prolonged time in a production environment spanning tens of thousands of systems. The system handles initial raw log volumes often reaching 250 billion events per day, significantly reducing them through filtering and ML inference to a handful of daily tickets for human investigation. Live experiments over an extended timeline demonstrate a general improvement in the model's precision over time due to the active learning feature. This approach offers a self-sustained, low-overhead, and low-maintenance solution, allowing security professionals to guide model learning as expert "teachers". Despite significant advancements in machine learning (ML) for security, traditional rule-based detection remains the predominant approach in enterprise security operations. This is evidenced by the low adoption rate of ML-based technologies in Security Operations Centers (SOC), with one study [1] finding that only 10% of participating SOCs utilized AI/ML security monitoring tools.


A User Manual for cuHALLaR: A GPU Accelerated Low-Rank Semidefinite Programming Solver

Aguirre, Jacob, Cifuentes, Diego, Guigues, Vincent, Monteiro, Renato D. C., Nascimento, Victor Hugo, Sujanani, Arnesh

arXiv.org Artificial Intelligence

We present a Julia-based interface to the precompiled HALLaR and cuHALLaR binaries for large-scale semidefinite programs (SDPs). Both solvers are established as fast and numerically stable, and accept problem data in formats compatible with SDPA and a new enhanced data format taking advantage of Hybrid Sparse Low-Rank (HSLR) structure. The interface allows users to load custom data files, configure solver options, and execute experiments directly from Julia. A collection of example problems is included, including the SDP relaxations of the Matrix Completion and Maximum Stable Set problems.


Get the macOS Finder to Do Just About Anything by Typing Natural Language Commands

WIRED

I'm genuinely not sure if large language models--often referred to as "AI" in shorthand--are the future of computing. But I also don't think chatbots are how people will use this technology in the years to come. Substage, an indie Mac application by developer Joseph Humfrey, is a simple app that points to a potential alternative--one that's useful right now. This application floats under every Finder window, meaning you see it only when you're browsing files in macOS. You can type English-language sentences into it to do things like rename, convert, or compress files.


CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research

Huang, Sian-Yao, Yang, Cheng-Lin, Lin, Che-Yu, Huang, Chun-Ying

arXiv.org Artificial Intelligence

This research addresses command-line embedding in cybersecurity, a field obstructed by the lack of comprehensive datasets due to privacy and regulation concerns. We propose the first dataset of similar command lines, named CyPHER, for training and unbiased evaluation. The training set is generated using a set of large language models (LLMs) comprising 28,520 similar command-line pairs. Our testing dataset consists of 2,807 similar command-line pairs sourced from authentic command-line data. In addition, we propose a command-line embedding model named CmdCaliper, enabling the computation of semantic similarity with command lines. Performance evaluations demonstrate that the smallest version of CmdCaliper (30 million parameters) suppresses state-of-the-art (SOTA) sentence embedding models with ten times more parameters across various tasks (e.g., malicious command-line detection and similar command-line retrieval). Our study explores the feasibility of data generation using LLMs in the cybersecurity domain. Furthermore, we release our proposed command-line dataset, embedding models' weights and all program codes to the public. This advancement paves the way for more effective command-line embedding for future researchers.


Carbon Filter: Real-time Alert Triage Using Large Scale Clustering and Fast Search

Oliver, Jonathan, Batta, Raghav, Bates, Adam, Inam, Muhammad Adil, Mehta, Shelly, Xia, Shugao

arXiv.org Artificial Intelligence

"Alert fatigue" is one of the biggest challenges faced by the Security Operations Center (SOC) today, with analysts spending more than half of their time reviewing false alerts. Endpoint detection products raise alerts by pattern matching on event telemetry against behavioral rules that describe potentially malicious behavior, but can suffer from high false positives that distract from actual attacks. While alert triage techniques based on data provenance may show promise, these techniques can take over a minute to inspect a single alert, while EDR customers may face tens of millions of alerts per day; the current reality is that these approaches aren't nearly scalable enough for production environments. We present Carbon Filter, a statistical learning based system that dramatically reduces the number of alerts analysts need to manually review. Our approach is based on the observation that false alert triggers can be efficiently identified and separated from suspicious behaviors by examining the process initiation context (e.g., the command line) that launched the responsible process. Through the use of fast-search algorithms for training and inference, our approach scales to millions of alerts per day. Through batching queries to the model, we observe a theoretical maximum throughput of 20 million alerts per hour. Based on the analysis of tens of million alerts from customer deployments, our solution resulted in a 6-fold improvement in the Signal-to-Noise ratio without compromising on alert triage performance.


Intrusion Detection at Scale with the Assistance of a Command-line Language Model

Lin, Jiongliang, Guo, Yiwen, Chen, Hao

arXiv.org Artificial Intelligence

Intrusion detection is a long standing and crucial problem in security. A system capable of detecting intrusions automatically is on great demand in enterprise security solutions. Existing solutions rely heavily on hand-crafted rules designed by security operators, which suffer from high false negative rates and poor generalization ability to new, zero-day attacks at scale. AI and machine learning offer promising solutions to address the issues, by inspecting abnormal user behaviors intelligently and automatically from data. However, existing learning-based intrusion detection systems in the literature are mostly designed for small data, and they lack the ability to leverage the power of big data in cloud environments. In this paper, we target at this problem and introduce an intrusion detection system which incorporates large-scale pre-training, so as to train a large language model based on tens of millions of command lines for AI-based intrusion detection. Experiments performed on 30 million training samples and 10 million test samples verify the effectiveness of our solution.


IsoEx: an explainable unsupervised approach to process event logs cyber investigation

Lavieille, Pierre, Atlas, Ismail Alaoui Hassani

arXiv.org Artificial Intelligence

39 seconds. That is the timelapse between two consecutive cyber attacks as of 2023. Meaning that by the time you are done reading this abstract, about 1 or 2 additional cyber attacks would have occurred somewhere in the world. In this context of highly increased frequency of cyber threats, Security Operation Centers (SOC) and Computer Emergency Response Teams (CERT) can be overwhelmed. In order to relieve the cybersecurity teams in their investigative effort and help them focus on more added-value tasks, machine learning approaches and methods started to emerge. This paper introduces a novel method, IsoEx, for detecting anomalous and potentially problematic command lines during the investigation of contaminated devices. IsoEx is built around a set of features that leverages the log structure of the command line, as well as its parent/child relationship, to achieve a greater accuracy than traditional methods. To detect anomalies, IsoEx resorts to an unsupervised anomaly detection technique that is both highly sensitive and lightweight. A key contribution of the paper is its emphasis on interpretability, achieved through the features themselves and the application of eXplainable Artificial Intelligence (XAI) techniques and visualizations. This is critical to ensure the adoption of the method by SOC and CERT teams, as the paper argues that the current literature on machine learning for log investigation has not adequately addressed the issue of explainability. This method was proven efficient in a real-life environment as it was built to support a company\'s SOC and CERT


Apple's VisionOS Makes a Bold Leap in Computer Interface

WIRED

Like everyone else who got to test Apple's new Vision Pro after its unveiling at the Worldwide Developers Conference in Cupertino, California, this week, I couldn't wait to experience it. But when an Apple technician at the ad hoc test facility used an optical device to check out my prescription lenses, I knew that there might be a problem. The lenses in my spectacles have prisms to address a condition that otherwise gives me double vision. Apple has a set of preground Zeiss lenses to handle most of us who wore glasses, but none could address my problem. In any case, my fears were justified: When I got to the demo room, the setup for eye-tracking--a critical function of the device--didn't work. I was able to experience only a subset of the demos.


ChatGPT Sparked a New AI Race and Revived the Popularity of Text Boxes

#artificialintelligence

Not even OpenAI comes close. Before it became the fastest-growing consumer app in history, before it popularised the phrase "generative pre-trained transformers," and before every company you can think of was racing to adopt its underlying model, ChatGPT debuted in November as a "research preview." In this article, we have explained how ChatGPT sparked a new AI race and revived the popularity of text boxes. Read to know more about ChatGPT sparked a new AI race. The blog post that announced ChatGPT has since become a hilarious case study in underselling.


How to Identify Fuzzy Duplicates in Your Tabular Dataset

#artificialintelligence

Imagine you have a dataset with over a million records that may contain some fuzzy duplicates. The simplest yet intuitive approach that many often come up with involves comparing every pair of records. However, this quickly gets infeasible as the size of your dataset grows. Even if we assume a decent speed of 10,000 comparisons per second, it will take roughly three years to complete. CSVDedupe is an ML-based open-source command-line tool that identifies and removes duplicate records in a CSV file.